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Abstract 

The growing language technology indus- 
try needs measurement tools to allow re- 
searchers, engineers, managers, and cus- 
tomers to track development, evaluate 
and assure quality, and assess suitability 
for a variety of applications. 

The TSNLP (Test Suites for Natural Lan- 
guage Processing) project^] has investi- 
gated various aspects of the construc- 
tion, maintenance and application of sys- 
tematic test suites as diagnostic and 
evaluation tools for NLP applications. 
The paper summarizes the motivation 
and main results of tsnlp: besides the 
solid methodological foundation of the 
project, tsnlp has produced substan- 
tial (i.e. larger than any existing gen- 
eral test suites) multi-purpose and multi- 
user test suites for three European lan- 
guages together with a set of specialized 
tools that facilitate the construction, ex- 
tension, maintenance, retrieval, and cus- 
tomization of the test data. 

As TSNLP results, including the data and 
technology, are made publicly available, 
the project presents a valuable linguistic 



The project was started in December 1993 and 
completed in March 1996; the consortium combines 
strong expertise in machine translation, evaluation, 
and natural language processing respectively and in- 
cludes Aerospatiale France as an industrial partner. 

Most of the project results (documents, bibliog- 
raphy, test data, and software) as well as on-line 
access to the test suite database can be obtained 
thro ugh the world-wide web from the tsnl p home 
page |http : //tsnlp .dfki .uni-sb.de/tsnlp/ . 



resource that has the potential of provid- 
ing a wide-spread pre-standard diagnos- 
tic and evaluation tool for both develop- 
ers and users of NLP applications. 

1 Background and Motivation 

Evaluation of NLP applications plays an increas- 
ingly important role in both the academic and in- 
dustrial NL communities. Two tools tradition- 
ally used for evaluating and testing NLP sys- 
tems are test suites and test corpora respec- 
tively. The two can be seen as serving comple- 
mcntary purposes (se c [Balkan ct al. (1994 ) and 
Dauphin et al. (1995a)): in contrast to text cor- 



pora, whose main advantage is that they reflect 
naturally occurring data, the key properties of test 
suites are (i) systematicity, (ii) control over data, 
(hi) inclusion of negative data, and (iv) exhaustiv- 
ity. 

Among the main motivations for the tsnlp 
project were the lack of general guidelines for the 
test suite construction, of adequate and compre- 
hensive test material, and of appropriate tools. 
The resulting duplication of effort among test 
suite developers obviously leads to a waste of 
time and resources. In addition, one of the main 
conclusions of a study of existing tests suites 
conducted during the first stage of the project 
QEstival et al. (1994| )) was that the reusability of 
existing test suites is severely hampered by their 
lack of structure and anno tations. Indeed, despite 
the pioneering efforts of Flickinger et al. (1987 ) 



and Nerbonne et al. (1993 ), most of the exist 



The tsnlp project was funded within the Linguistic 
Research Engineering (lre) programme of the Euro- 
pean Commission (dg XIIl) under research grant LRE- 
62-089. 



ing test suites were written for some specific sys- 
tem or simply enumerate a number of interest- 
ing examples and, thus, do not meet the demand 
for large, systematic, well-documented, highly- 
structured and annotated collections of linguistic 
material, which is now required by a growing num- 
ber of NLP applications. The TSNLP test suite 



addresses these demands and provides powerful 
tools for the construction and manipulation of the 
test data. 

On the one hand, since every NLP system 
(whether commercial or under development) ex- 
hibits specific features which make it unique, and 
every user (or developer) of an NLP system has 
specific needs and requirements, the tsnlp ap- 
proach is based on the assumption that, in or- 
der to yield informative and interpretable results, 
any test suite used for an actual test or evalu- 
ation must be specific (at least to some degree) 
to the system and the user. On the other hand, 
since testing or evaluating NLP systems is per- 
formed for a variety of purposes, the tsnlp ap- 
proach is also guided by the need to provide test 
material which is easily reusable. To achieve these 
two goals of specificity and reusability, the tradi- 
tional notion of a test suite as a monolithic set of 
test items has been abandoned in favour of the no- 
tion of a database in which test items are stored 
together with a rich inventory of associated lin- 
guistic and non- linguistic annotations. 

Thus, the test suite database serves as a virtual 
(or meta) test suite that provides the means to ex- 
tract the relevant subset of the test data suitable 
for some specific task. Using the explicit struc- 
ture of the data and the TSNLP annotations, the 
database engine allows searching and retrieving 
data from the virtual test suite, thereby creating a 
concrete test suite instance according to arbitrary 
linguistic and extra-linguistic constraints. Since, 
additionally, there are tools provided for the main- 
tenance and extension of the test suite database, 
the TSNLP virtual test suite approach is an essen- 
tial innovation leading the way to a new genera- 
tion of highly-structured reusable test suites. 

2 Test Suite Design and 
Methodology 

Based on a survey of existing test suites and 
an analysis of the diagnostic and evaluation re- 
quirements of both NL technology developers and 
users, tsnlp has developed the methodology for 
the construction of core test data, that is, test 
items reflecting central language phenomena and 
that are applicable to a wide range of applications, 
including parsers, grammar checkers, and con- 
trolled language checkers (Balkan et al. (1996)). 

The tsnlp methodology is designed to optimize 
(i) control over test data, (ii) progressivity, and 
(iii) systematicity. These are necessary qualities 
for an adequate, reusable test suite, which are dif- 
ficult to find in test corpora. The methodology 
also addresses the specific goals of tsnlp to pro- 



duce multi-purpose, multi-user, and multilingual 
test suites. 

Control over test data What makes test 
suites valuable in comparison to corpora is that 
they can focus on specific linguistic phenomena 
and that each phenomenon can be presented both 
in isolation and controlled combinations in which 
as many linguistic parameters as possible are be- 
ing kept under control. This is particularly the 
case when a phenomenon is illustrated by system- 
atic variation over the parameters used to describe 
this phenomenon, while all other parts of the test 
items remain constant. 

Vocabulary is an aspect of the test data that 
needs to be controlled, tsnlp achieves this by 
restricting the vocabulary in size as well as in 
domain. Categorially and semantically ambigu- 
ous words are avoided where possible and only 
included when ambiguity is explicitly tested for. 

Additionally, tsnlp attempts to control the in- 
teraction of phenomena by keeping the test items 
as small as possible. Therefore, a number of guide- 
lines for this purpose (such as use declarative sen- 
tences and avoid modifiers and adjuncts) is pro- 
vided. 

Progressivity Progressivity is the principle of 
starting from simple test items and increasing 
their complexity. In tsnlp, this aspect is ad- 
dressed by requiring that each test item focuses 
only on a single phenomenon (or rather subphe- 
nomenon or even feature) which distinguishes it 
from all other test items. This principle not only 
ensures systematicity during the test data con- 
struction but also allows test data users to apply 
the test data in a progressive order resulting the 
special attribute presupposition in the phenomena 
classification. Thus, the precise identification of 
the coverage of a system and of its deficiencies is 
rendered easier. 

Systematicity Systematicity 
refers to the depth of coverage of a test suite, with 
respect to both well-formed and ill-formed items. 
Systematicity in tsnlp is achieved for well-formed 
items by the explicit classification of test items ac- 
cording to phenomena and sub-phenomena. Nega- 
tive test data permits testing for overgeneration as 
well as for coverage. Ill-formed items are derived 
from well-formed ones by systematic variation of 
the parameters through the application of one (or 
more) of four operations, namely: 

• replacement (e.g. change of person inside 
a verb in subject- verb agreement) 
(French) L ' ingenieur vient. 
(French) *L ' ingenieur viens. 



• addition (e.g. addition of an object NP in a 
sentence with an intransitive verb) 
(German) Der Manager arbeitet. 
(German) *Der Manager arbeitet den Vor- 
trag. 

• deletion (e.g. deletion of an obligatory com- 
plement) 

(German) Der Manager halt den Vortrag. 
(German) *Der Manager halt. 

• PERMUTATION (e.g. inverting the order of the 
verb and the object) 

(English) He saw the boy. 
(English) *He the boy saw. 

In general, the systematicity of test data was 
greatly enhanced through the use of special- 
purpose tools in the data construction and vali- 
dation process (see section || below) . 

Multilinguality Multilinguality is achieved in 
the TSNLP test suites by covering the same range 
of phenomena in English, French and German, 
and adopting the same classification for these phe- 
nomena in the three languages. Furthermore, the 
choice of related terminology for the categorial 
and structural description contributes to the com- 
parability and consistency of the test items (see 
section^ for details). 

Documentation To enhance the usability and 
extensibility of TSNLP results, a three- volume user 
guide is under preparation providing clear instruc- 
tions for the assessment of the methodology, test 
data, and tools developed. 

3 tsnlp Annotation Schema 

Following its survey of existing test suites and 
guidelines for the test suite construction, tsnlp 
designed a detailed annotation schema for the test 
data which does not presuppose a specific linguis- 
tic theory (where this exists), a particular evalu- 
ation situation or application type. 

Test data and annotations in TSNLP test suites 
are organized at four distinct representational lev- 
els: 

• Core Data The core of the test data consists 
in the individual test items together with all 
general, categorial and structural information 
that is independent of a token phenomenon or 
application. Besides the actual input string, 
annotations at this level include (i) bookkeep- 
ing and documentation information (author, 
date, id number), (ii) the item format, its 



length, category and well-formedness code, 
(iii) the (morpho-)syntactic categories and 
string positions of the lexical and phrasal el- 
ements constituting the test item, and (iv) 
an (underspecified) representation of its func- 
tional structure. Encoding a dependency or 
functor-argument graph rather than a phrase 
structure tree allows generalizations over po- 
tentially controversial phrase structure con- 
figurations and, thus, avoids imposing a spe- 
cific constituent structure but still can be 
mapped onto one. 

• Phenomenon-Related Data Based on a 
hierarchical classification of linguistic (and 
extra-linguistic) phenomena (e.g. verb va- 
lency as a subtype of general complementa- 
tion) , each phenomenon is identified by a phe- 
nomenon id and by its supertype(s). Interac- 
tion with other phenomena as well as the phe- 
nomena which must be presupposed are also 
given (see section on progressivity) . In ad- 
dition, the (syntactic) parameters which are 
relevant for the phenomenon (e.g. the number 
and type of complements in the case of verb 
valency) are described. Individual test items 
can be assigned to one or several phenomena 
and annotated according to the correspond- 
ing parameters. 

• Test Sets Test items can optionally be 
grouped into test sets. A test set is a group 
of test items containing typically one positive 
example and one or more negative examples. 
The relation between positive and negative 
test items has been one of the most challeng- 
ing questions in designing test data and, as 
has been mentioned, is based on the system- 
atic variation of phenomenon-specific param- 
eters. 

• User and Application Parameters Infor- 
mation that typically correlates with the use 
of a test suite for different types of evalua- 
tion and for different applications (e.g. rat- 
ings of frequency or relevance for a particu- 
lar domain) is factored from the remainder of 
the data into user & application profiles. As 
part of the customization process users of the 
tsnlp test suites are encouraged to extend 
this part of the test suite database and add 
whatever (formal or informal) information is 
necessary for their specific requirements. 

In addition to the parts of the annotation 
schema that follow a formal specification, there 
is room for textual comments at the various levels 



Test Item 



item id: 24020101 author: issco date: jan-95 

register: formal format: none origin: invented 

difficulty: 1 wellformedness: 1 category: S 

input: L' ingenieur vient . length: 3 
comment: 



position instance 



category function domain 



0:2 
2:3 



L ' ingenieur 
vient 



NPsg 
V-3-sg 



subj 
func 



2:3 
0:3 



Phenomenon 



phenomenon id: 2402 author: issco date: jan-95 
name: C-Complementation_subj (NP)_V 
supertypes: C -Complementation 
presupposition: C-Agreement, NP_Agreement 
restrictions: neutral interaction: none purpose: test 
comment: Intransitive verb (valency:!) 



Figure 1: Sample instance of the tsnlp annota- 
tion schema for one test item: the annotations are 
given in tabular form for the test item, analysis, 
and phenomenon levels. 

to accommodate information that cannot or need 
not be formalized. 

4 Test Data Construction 

Following the tsn lp test suite guidelines 



( |Estival et al. (1994[ )) and using the annotation 
schema sketched above, the construction of test 
data was based on a classification of the (syntac- 
tic) phenomena to be covered. From judgements 
on the linguistic relevance and frequency for the 
individual languages, the following list of core phe- 
nomena for TSNLP was compiled: 

• complementation; 

• agreement; 

• modification; 

• diathesis; 

• modality, tense, and aspect; 

• sentence and clause types; 

• word order; 

• coordination; 

• negation; and 

• extragrammatical (e.g. parentheticals and 
temporal expressions). 

A further sub-classification of phenomena is 
made according to the relevant syntactic domains 
in which a phenomenon occurs (e.g. sentences (S), 
clauses (C), noun phrases (NP) et al.). Figure 



Phenomenon 


English 


French 


German 


C_Com plementation 


148 


863 


1 QQ 1 KfiT 
lOOjOO/ 




C_Agreement 


68 


55 
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ZZ*f 1 ID 


C_Modification 
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NP_Agreement 
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Tense Aspect Modality 
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39 


77 1 275 


186 134 


Sentence Types 


80 


100 


389 387 


105 | 14 


Coordination 


147 


186 


379 319 


105 429 


Negation 


289 


129 


68 1 100 


82|210 


Word Order 






7 7 


60 | 160 


Extragrammatical 


24 


34 




253 10 


Total 


1582 


3036 


2001|3130 


1732 | 3308 



Figure 2: Status of the TSNLP data (December 
1995): relevance and breadth of individual phe- 
nomena present language-specific variation (the 
numbers given are for grammatical vs. ungram- 
matical items). The cross-classification of phe- 
nomenon names results from attaching the syn- 
tactic domain as a prefix to the phenomenon 
name (e.g. C -Complementation, NP_Agreement et 
al.). Individual phenomena are often further sub- 
classified according to phenomenon-internal di- 
mensions. 



gives an overview of the test material available. 
For each of the three languages some 5000 test 
items are provided. Therefore, tsnlp has already 
achieved a substantially broader and deeper cover- 
age than previous general-purpose test suites (the 
still very popular Hewlett-Packard test suite, for 
instance, has a coverage of 3000 test items for En- 
glish only). 

In order to enforce consistency of annotations 
across the three languages, canonical lists of the 
categories and functions used in the description of 
categorial and dependency structure were estab- 
lished. The dimensions chosen in the classification 
attempt to avoid the presupposition of very spe- 
cific assumptions of a particular theory of gram- 
mar (or of a language) , and rather try to capture 
those distinctions that seem to be relevant across 
the set of tsnlp core phenomena. 

5 Test Suite Technology 

Because the test data construction proper as well 
as the customization and application of a general- 
purpose test suite to a specific NLP system or do- 
main are laborious, cost-intensive and error-prone 
tasks, tsnlp put strong emphasis on supplying 
suitable special-purpose tools to facilitate both 
the devel opment as well as u sage of the tsnlp 
test data ( Ocpcn et al. (19*9^ ) give an overview) . 
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Figure 3: Sketch of the modular tsdbi design: the 
database kernel is separated from client programs 
through a layer of interface functions. 

5.1 Test Data Construction 

To ease the time-consuming test data construction 
and to reduce erratic variations in filling in the 
tsnlp annotation schema, a graphical test suite 
construction tool (tsct) was implemented. The 
tool instantiates the annotation schema (see sec- 
tion |) as a form-based input mask and provides 
for (limited) consistency checking of the field val- 
ues. Additionally, tsct allows reusing previously 
constructed and annotated data, as quite often — 
when constructing a series of test items — it can 
be easier to duplicate and adapt a similar item 
rather than produce annotations from scratch. 

Additionally, for some 

of the test data a DCG-based test suite genera- 
tion tool ( Arnold et al. (1994 )) was deployed to 
automatically produce systematically varied (i.e. 
both grammatical and ungrammatical) test items 
together with some part of the annotations. 

5.2 Test Data Maintenance and Retrieval 

To implement the TSNLP virtual test suite ap- 
proach (see section [l]), the test data is mounted 
on a relational database to satisfy the following 
key desiderata: 

• usability: to facilitate the application of the 
methodology, technology, and test data de- 
veloped in tsnlp to a wide variety of diagno- 
sis and evaluation purposes for different ap- 
plications by developers or users with varied 
backgrounds; 

• suitability: to meet the specific necessities 
of storing and maintaining natural language 
test data (e.g. in string processing) and to 
provide maximally flexible interfaces; 

• adaptability and extensibility: to enable 
and encourage users of the database to add 
test data and annotations according to their 



needs without changes to the underlying data 
model; and 

• portability and simplicity: to make the 
results of TSNLP available on several different 
hard- and software platforms and easy to use. 

To account for the potentially different require- 
ments of NLP developers and users and in order 
to provide suitable interfaces to human test suite 
users as well as to external application programs, a 
dual database implementation was carried out: (i) 
while a proprietary implementation (called tsdbj) 
allowed the fine-tuning of both the query language 
and interfaces, (ii) a second version (tsdb2) builds 
on a commercial database product and, thus, is 
compliant to common industry standards allow- 
ing (industrial) users of the tsnlp test suite to 
acquire on-site technical support where necessary. 

The tsdbi implementation is a small and effi- 
cient relational database engine in ANSI C. It was 
designed with an open and documented interface 
layer (see figure |^) that enables test suite users 
to bidirectionally link an application being tested 
to the database and run automated retrieve, pro- 
cess, and compare cycles. Diagnostic results ob- 
tained can be stored into the database as part of 
the user & application profile (see section ||) for 
use in continuous progress evaluation (section ^| 
gives an example). 

An ASCII-based command shell interprets a 
simplified SQL-style query language and provides 
editing, completion, and command and query re- 
sult history. A network database server gives re- 
mote (though read-only) access to the test data. 

For the alternative implementation tsdb2 the 
competitively priced database package Microsoft 
FoxPro was deployed because it is available for 
both Apple Macintosh and personal computers 
running MS Windows^ and has a very wide distri- 
bution. The database provides convenient graph- 
ical browsing and editing of the data (using pull- 
down menus for finite domain fields) as well as 
standard import and export facilities to exchange 
data with external applications. 

5.3 Query and Retrieval: An Example 

To illustrate the capacity and flexibility of the 
tsnlp annotation schema in conjunction with a 
relational database retrieval engine, a query ex- 
ample in the simplified SQL-like query language 



Building on the popular database package MS 
Access, another implementation of the test suite 
database is currently being developed. This version 
will provide a similar functionality to tsdb2 and be 
available for the MS Windows world. 



interpreted by tsdbq together with an informal En- 
glish paraphrase is given:^ 

• find all grammatical test items that are as- 
sociated with the phenomenon of clausal (i.e. 
subject verb) agreement and have pronomi- 
nal subjects: 

select i-id i-input 
where i-wf = 1 & 

p-name = "C_Agreement" & 
a-function = "subj" & 
a-category ~ "~PR0N" 

6 Customization and Testing 

To validate the tsnlp annotation methodology, 
test data, and tools, the project results have been 
tested against three different application types, 
viz. a commercial grammar checker for French, 
a controlled language checker (SECC) for English 
and a parser (the HPSG system developed at 
DFKI) for German. As in this setup the evalu- 
ation situations ranged from user-level black box 
evaluation of a commercial product to glass box 
diagnosis of a research prototype under develop- 
ment (the DFKI system), a number of interesting 
results were obtained on both the adequacy of the 
tsnlp approach as well as the quality of the sys- 
tems being tested. 

French Grammar Checker The real life eval- 
uation scenario (i.e. the diagnostic evaluation of 
a commercial NLP product) enabled Aerospatiale 
to give a precise account of the type of information 
obtainable from the use of TSNLP. 

The following major performance characteris- 
tics were revealed: 

• TSNLP ill-formed test items are frequently not 
detected as such. 

• The system performs well on (both well- 
formed and ill-formed) test items illustrating 
the phenomenon of agreement, in clauses as 
well as in noun phrases. 

• The system does not master the phenomenon 
of complementation, especially not in adjec- 
tival phrases. 

• Sentential test items produce better results 
than subscntential ones. 



• The analysis capabilities of the system are 
limited (19% of the tsnlp test items were 
not fully analysed). 

The interpretation of the results produced by 
the system and the comparison with the linguis- 
tic information provided in the tsnlp annotations 
led to an identification of the major shortcom- 
ings of the system in terms of systematicity, lex- 
ical and morpho-syntactic deficiencies, and inter- 
ference with other system components. 



English Controlled Language Checker Es- 

sex tested the co ntrolled language checker SECC 
( Adriaens (1994 )). Like Aerospatiale, Essex was 
mostly in a black box situation with respect to the 
system, except that they had access to the con- 
trolled grammar language descriptions (but not to 
the system rules). The testing involved the writ- 
ing of a large number of customised test items, due 
to the fact that many CL rules are lexically based, 
whereas the core test suite concentrates on syntac- 
tic phenomena. The testing proved very valuable 
in highlighting deficiencies in the system perfor- 
mance, as well as in the rule descriptions and gave 
pointers to the possible source of those errors. 



German Parser In connecting the German 
tsnlp test suite to the DFKI HPSG system^ both 
the test data as well as the tsnlp technology were 
validated. Building on the C version of the tsnlp 
database (tsdbi), a bidirectional interface to the 
application was established allowing the instanti- 
ation of a DFKI user & application profile for the 
storage of application-specific data (including per- 
formance measures and a semantic specification of 
the expected output). 

The seamless coupling between the test suite 
and the NL system allows running fully auto- 
mated retrieve, process, and compare cycles in the 
continuous progress evaluation of the grammar 
and software such that — after making changes 
to the system — the impact on coverage and 
performance can be determined in an overnight 
batch job. The tsnlp test data and database 
technology proved to be a highly adequate tool 
for glass-box diagnostic evaluation; besides, the 
testing experience provided valuable feedback for 
both the test suite and the application tested 



(Dauphin ct al. (1995b)) 



' Additional sample queries and more details on 
the database schema (includin g relation and at - 
tribute names) can be found in Oepen et al. (1996) 
and from the tsnlp World-Wide We b home page 
tittp: //tsnlp. df ki .uni-sb . de/tsnlp^ . 



4 The DFKI HPSG system is a state-of-the-art NL 
core engine and grammar engineering platform; it is 
in active use at several research institutions (includ- 
ing CSLI Stanford, Brandeis, Ohio State, and Simon 
Fraser Universities), primarily for HPSG-style gram- 
mar development for German, English, Japanese, and 
Italian. 
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